Information-Theoretic Approaches for Measuring the Structural Similarity of Semistructured Documents

نویسندگان

  • Sven Helmer
  • Nikolaus Augsten
  • Michael Böhlen
چکیده

We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on informationtheoretic concepts. Common to all approaches is a twostep procedure: first we extract and linearize the structural information from documents and then we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Measuring the Structural Similarity of Semistructured Documents Using Entropy

We propose a technique for measuring the structural similarity of semistructured documents based on entropy. After extracting the structural information from two documents we use either Ziv-Lempel encoding or Ziv-Merhav crossparsing to determine the entropy and consequently the similarity between the documents. To the best of our knowledge, this is the first true linear-time approach for evalua...

متن کامل

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...

متن کامل

Exploiting Structural Similarity For Effective Web Information Extraction

In this paper we propose an architecture that exploit web pages stuctural information for the extraction of relevant information from them. In this architecture, a primary role played by a distance-based classification methodology is devised. Such a methodology is based on an efficient and effective technique for detecting structural similarities among semistructured documents, which significan...

متن کامل

SOME SIMILARITY MEASURES FOR PICTURE FUZZY SETS AND THEIR APPLICATIONS

In this work, we shall present some novel process to measure the similarity between picture fuzzy sets. Firstly, we adopt the concept of intuitionistic fuzzy sets, interval-valued intuitionistic fuzzy sets and picture fuzzy sets. Secondly, we develop some similarity measures between picture fuzzy sets, such as, cosine similarity measure, weighted cosine similarity measure, set-theoretic similar...

متن کامل

Measuring the Structural Similarity of Web-based Documents: A Novel Approach

Most known methods for measuring the structural similarity of document structures are based on, e.g., tag measures, path metrics and tree measures in terms of their DOM-Trees. Other methods measures the similarity in the framework of the well known vector space model. In contrast to these we present a new approach to measuring the structural similarity of web-based documents represented by so c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011